{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# System Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1. Install system dependencies and python packages. Prepare the environment.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, install all dependencies and python packages as `root`. Run commands and make sure the installations are successful.\n",
    "\n",
    "```bash\n",
    "apt update\n",
    "\n",
    "apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` mailutils\n",
    "\n",
    "python3 -m pip install notebook==6.5.2\n",
    "python3 -m pip install jupyter_server==1.23.4\n",
    "python3 -m pip install jupyter_highlight_selected_word\n",
    "python3 -m pip install jupyter_contrib_nbextensions\n",
    "python3 -m pip install virtualenv==20.21.1\n",
    "python3 -m pip uninstall -y ipython\n",
    "python3 -m pip install ipython==8.21.0\n",
    "python3 -m pip uninstall -y traitlets\n",
    "python3 -m pip install traitlets==5.9.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Required for Ubuntu***\n",
    "\n",
    "Check that there isn't an entry for your hostname mapped to 127.0.0.1 or 127.0.1.1 in /etc/hosts (Ubuntu is notorious for this). If there is, delete it.\n",
    "Then add `<ip>` and `<hostname>` for master and worker nodes.\n",
    "\n",
    "Example /etc/hosts:\n",
    " \n",
    "```\n",
    "127.0.0.1 localhost\n",
    "\n",
    "# The following lines are desirable for IPv6 capable hosts\n",
    "::1     ip6-localhost ip6-loopback\n",
    "fe00::0 ip6-localnet\n",
    "ff00::0 ip6-mcastprefix\n",
    "ff02::1 ip6-allnodes\n",
    "ff02::2 ip6-allrouters\n",
    "\n",
    "10.0.0.117 sr217\n",
    "10.0.0.113 sr213\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. Format and mount disks**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a python virtual environment to finish the system setup process:\n",
    "\n",
    "```bash\n",
    "virtualenv -p python3 -v venv\n",
    "source venv/bin/activate\n",
    "```\n",
    "\n",
    "And install packages under `venv`:\n",
    "```bash\n",
    "(venv) python3 -m pip install questionary\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run script [init_disks.py](./init_disks.py) to format and mount disks. **Be careful when choosing the disks to format.** If you see errors like `device or resource busy`, perhaps the partition has been mounted, you should unmount it first. If you still see this error, reboot the system and try again."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exit `venv`:\n",
    "```bash\n",
    "(venv) deactivate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. Create user `sparkuser`**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create user `sparkuser` without password and with sudo priviledge. It's recommended to use one of the disks as the home directory instead of the system drive.\n",
    "\n",
    "```bash\n",
    "mkdir -p /data0/home/sparkuser\n",
    "ln -s /data0/home/sparkuser /home/sparkuser\n",
    "cp -r /etc/skel/. /home/sparkuser/\n",
    "adduser --home /home/sparkuser --disabled-password --gecos \"\" sparkuser\n",
    "\n",
    "chown -R sparkuser:sparkuser /data*\n",
    "\n",
    "echo 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate ssh keys for `sparkuser`\n",
    "\n",
    "```bashrc\n",
    "su - sparkuser\n",
    "```\n",
    "\n",
    "```bashrc\n",
    "rm -rf ~/.ssh\n",
    "ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n",
    "cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys\n",
    "\n",
    "exit\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate ssh keys for `root`, and enable no password ssh from `sparkuser`\n",
    "\n",
    "```bash\n",
    "rm -rf /root/.ssh\n",
    "ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa <<<y >/dev/null 2>&1\n",
    "cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n",
    "cat /home/sparkuser/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Login to `sparkuser` and run the first-time ssh to the `root`\n",
    "\n",
    "```bash\n",
    "su - sparkuser\n",
    "```\n",
    "\n",
    "```bash\n",
    "ssh -o StrictHostKeyChecking=no root@localhost ls\n",
    "ssh -o StrictHostKeyChecking=no root@127.0.0.1 ls\n",
    "ssh -o StrictHostKeyChecking=no root@`hostname` ls\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Required for Ubuntu***\n",
    "\n",
    "Run below command to comment out lines starting from `If not running interactively, don't do anything` in ~/.bashrc\n",
    "\n",
    "```bash\n",
    "sed -i '5,9 s/^/# /' ~/.bashrc\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4. Configure jupyter notebook**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As `sparkuser`, install python packages\n",
    "\n",
    "```bash\n",
    "cd /home/sparkuser/.local/lib/ && rm -rf python*\n",
    "\n",
    "python3 -m pip install --upgrade jsonschema\n",
    "python3 -m pip install jsonschema[format]\n",
    "python3 -m pip install sqlalchemy==1.4.46\n",
    "python3 -m pip install papermill Black\n",
    "python3 -m pip install NotebookScripter\n",
    "python3 -m pip install findspark spylon-kernel matplotlib pandasql pyhdfs\n",
    "python3 -m pip install ipywidgets jupyter_nbextensions_configurator ipyparallel\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configure jupyter notebook. Setup password when it prompts\n",
    "\n",
    "```bash\n",
    "jupyter notebook --generate-config\n",
    "\n",
    "jupyter notebook password\n",
    "\n",
    "mkdir -p ~/.jupyter/custom/\n",
    "\n",
    "echo '.container { width:100% !important; }' >> ~/.jupyter/custom/custom.css\n",
    "\n",
    "echo 'div.output_stderr { background: #ffdd; display: none; }'  >> ~/.jupyter/custom/custom.css\n",
    "\n",
    "jupyter nbextension install --py jupyter_highlight_selected_word --user\n",
    "\n",
    "jupyter nbextension enable highlight_selected_word/main\n",
    "\n",
    "jupyter nbextension install --py widgetsnbextension --user\n",
    "\n",
    "jupyter contrib nbextension install --user\n",
    "\n",
    "jupyter nbextension enable codefolding/main\n",
    "\n",
    "jupyter nbextension enable code_prettify/code_prettify\n",
    "\n",
    "jupyter nbextension enable codefolding/edit\n",
    "\n",
    "jupyter nbextension enable code_font_size/code_font_size\n",
    "\n",
    "jupyter nbextension enable collapsible_headings/main\n",
    "\n",
    "jupyter nbextension enable highlight_selected_word/main\n",
    "\n",
    "jupyter nbextension enable ipyparallel/main\n",
    "\n",
    "jupyter nbextension enable move_selected_cells/main\n",
    "\n",
    "jupyter nbextension enable nbTranslate/main\n",
    "\n",
    "jupyter nbextension enable scratchpad/main\n",
    "\n",
    "jupyter nbextension enable tree-filter/index\n",
    "\n",
    "jupyter nbextension enable comment-uncomment/main\n",
    "\n",
    "jupyter nbextension enable export_embedded/main\n",
    "\n",
    "jupyter nbextension enable hide_header/main\n",
    "\n",
    "jupyter nbextension enable highlighter/highlighter\n",
    "\n",
    "jupyter nbextension enable scroll_down/main\n",
    "\n",
    "jupyter nbextension enable snippets/main\n",
    "\n",
    "jupyter nbextension enable toc2/main\n",
    "\n",
    "jupyter nbextension enable varInspector/main\n",
    "\n",
    "jupyter nbextension enable codefolding/edit\n",
    "\n",
    "jupyter nbextension enable contrib_nbextensions_help_item/main\n",
    "\n",
    "jupyter nbextension enable freeze/main\n",
    "\n",
    "jupyter nbextension enable hide_input/main\n",
    "\n",
    "jupyter nbextension enable jupyter-js-widgets/extension\n",
    "\n",
    "jupyter nbextension enable snippets_menu/main\n",
    "\n",
    "jupyter nbextension enable table_beautifier/main\n",
    "\n",
    "jupyter nbextension enable hide_input_all/main\n",
    "\n",
    "jupyter nbextension enable spellchecker/main\n",
    "\n",
    "jupyter nbextension enable toggle_all_line_numbers/main\n",
    "\n",
    "jupyter nbextensions_configurator enable --user\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clone Gluten\n",
    "\n",
    "```bash\n",
    "cd ~\n",
    "git clone https://github.com/apache/incubator-gluten.git gluten\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start jupyter notebook\n",
    "\n",
    "```bash\n",
    "mkdir -p ~/ipython\n",
    "cd ~/ipython\n",
    "\n",
    "nohup jupyter notebook --ip=0.0.0.0 --port=8888 &\n",
    "\n",
    "find ~/gluten/tools/workload/benchmark_velox/ -maxdepth 1 -type f -exec cp {} ~/ipython \\;\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Initialize\n",
    "<font color=red size=3> Run this section after notebook restart! </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Specify datadir. The directories are used for spark.local.dirs and hadoop namenode/datanode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "datadir=[f'/data{i}' for i in range(0, 8)]\n",
    "datadir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Specify clients(workers). Leave it empty if the cluster is setup on the local machine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "clients=''''''.split()\n",
    "print(clients)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Specify JAVA_HOME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "java_home = '/usr/lib/jvm/java-8-openjdk-amd64'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import socket\n",
    "import platform\n",
    "\n",
    "user=os.getenv('USER')\n",
    "print(f\"user: {user}\")\n",
    "print()\n",
    "\n",
    "masterip=socket.gethostbyname(socket.gethostname())\n",
    "hostname=socket.gethostname()   \n",
    "print(f\"masterip: {masterip} hostname: {hostname}\")\n",
    "print()\n",
    "\n",
    "hclients=clients.copy()\n",
    "hclients.append(hostname)\n",
    "print(f\"master and workers: {hclients}\")\n",
    "print()\n",
    "\n",
    "\n",
    "if clients:\n",
    "    cmd = f\"ssh {clients[0]} \" + \"\\\"lscpu | grep '^CPU(s)'\\\"\" + \" | awk '{print $2}'\"\n",
    "    client_cpu = !{cmd}\n",
    "    cpu_num = client_cpu[0]\n",
    "\n",
    "    cmd = f\"ssh {clients[0]} \" + \"\\\"cat /proc/meminfo | grep MemTotal\\\"\" + \" | awk '{print $2}'\"\n",
    "    totalmemory = !{cmd}\n",
    "    totalmemory = int(totalmemory[0])\n",
    "else:\n",
    "    cpu_num = os.cpu_count()\n",
    "    totalmemory = !cat /proc/meminfo | grep MemTotal | awk '{print $2}'\n",
    "    totalmemory = int(totalmemory[0])\n",
    "    \n",
    "print(f\"cpu_num: {cpu_num}\")\n",
    "print()\n",
    "\n",
    "print(\"total memory: \", totalmemory, \"KB\")\n",
    "print()\n",
    "\n",
    "mem_mib = int(totalmemory/1024)-1024\n",
    "print(f\"mem_mib: {mem_mib}\")\n",
    "print()\n",
    "\n",
    "is_arm = platform.machine() == 'aarch64'\n",
    "print(\"is_arm: \",is_arm)\n",
    "print()\n",
    "\n",
    "sparklocals=\",\".join([f'{l}/{user}/yarn/local' for l in datadir])\n",
    "print(f\"SPARK_LOCAL_DIR={sparklocals}\")\n",
    "print()\n",
    "\n",
    "%cd ~"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Set up clients\n",
    "<font color=red size=3> SKIP for single node </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Manually configure ssh login without password to all clients\n",
    "\n",
    "```bash\n",
    "ssh-copy-id -o StrictHostKeyChecking=no root@<client ip>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh root@{l} apt update > /dev/null 2>&1\n",
    "    !ssh root@{l} apt install -y sudo locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven vim pip sysstat gcc-9 libjemalloc-dev nvme-cli curl zip unzip bison flex linux-tools-common linux-tools-generic linux-tools-`uname -r` > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Create user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh -o StrictHostKeyChecking=no root@{l} ls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh root@{l} adduser --disabled-password --gecos '\"\"' {user}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh root@{l} cp -r .ssh /home/{user}/\n",
    "    !ssh root@{l} chown -R {user}:{user} /home/{user}/.ssh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh root@{l} \"echo -e 'sparkuser ALL=(ALL:ALL) NOPASSWD:ALL' | EDITOR='tee -a' visudo\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "***Required for Ubuntu***\n",
    "\n",
    "Run below command to comment out lines starting from If not running interactively, don't do anything in ~/.bashrc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh {l} sed -i \"'5,9 s/^/# /'\" ~/.bashrc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Use /etc/hosts on master node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !scp /etc/hosts root@{l}:/etc/hosts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Setup disks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !ssh root@{l} apt update > /dev/null 2>&1\n",
    "    !ssh root@{l} apt install -y pip > /dev/null 2>&1\n",
    "    !ssh root@{l} python3 -m pip install virtualenv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Manually run **2. Format and mount disks** section under [System Setup](#System-Setup)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Configure Spark, Hadoop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Download packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n",
    "# backup url: !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz > /dev/null 2>&1\n",
    "if is_arm:\n",
    "    # download both versions\n",
    "    !wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Create directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "cmd=\";\".join([f\"chown -R {user}:{user} \" + l for l in datadir])\n",
    "for l in hclients:\n",
    "    !ssh root@{l} '{cmd}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "cmd=\";\".join([f\"rm -rf {l}/tmp; mkdir -p {l}/tmp\" for l in datadir])\n",
    "for l in hclients:\n",
    "    !ssh {l} '{cmd}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "cmd=\";\".join([f\"mkdir -p {l}/{user}/hdfs/data; mkdir -p {l}/{user}/yarn/local\" for l in datadir])\n",
    "for l in hclients:\n",
    "    !ssh {l} '{cmd}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p {datadir[0]}/{user}/hdfs/name\n",
    "!mkdir -p {datadir[0]}/{user}/hdfs/namesecondary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !scp hadoop-3.2.4.tar.gz {l}:~/\n",
    "    !scp spark-3.3.1-bin-hadoop3.tgz {l}:~/\n",
    "    !ssh {l} \"mv -f hadoop hadoop.bak; mv -f spark spark.bak\"\n",
    "    !ssh {l} \"tar zxvf hadoop-3.2.4.tar.gz > /dev/null 2>&1\"\n",
    "    !ssh {l} \"tar -zxvf spark-3.3.1-bin-hadoop3.tgz > /dev/null 2>&1\"\n",
    "    !ssh root@{l} \"apt install -y openjdk-8-jdk > /dev/null 2>&1\"\n",
    "    !ssh {l} \"ln -s hadoop-3.2.4 hadoop; ln -s spark-3.3.1-bin-hadoop3 spark\"\n",
    "    if is_arm:\n",
    "        !ssh {l} \"tar zxvf hadoop-3.3.5-aarch64.tar.gz > /dev/null 2>&1\"\n",
    "        !ssh {l} \"cd hadoop && mv lib lib.bak && cp -rf ~/hadoop-3.3.5/lib ~/hadoop\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Configure bashrc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "\n",
    "cfg=f'''export HADOOP_HOME=~/hadoop\n",
    "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin\n",
    "\n",
    "export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\n",
    "export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop\n",
    "\n",
    "export SPARK_HOME=~/spark\n",
    "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH\n",
    "export PATH=$SPARK_HOME/bin:$PATH\n",
    "\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "if is_arm:\n",
    "    cfg += 'export CPU_TARGET=\"aarch64\"\\nexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'\n",
    "else:\n",
    "    cfg += f'export JAVA_HOME={java_home}\\nexport PATH=$JAVA_HOME/bin:$PATH\\n'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "with open(\"tmpcfg\",'w') as f:\n",
    "    f.writelines(cfg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !scp tmpcfg {l}:~/tmpcfg.in\n",
    "    !ssh {l} \"cat ~/tmpcfg.in >> ~/.bashrc\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} tail -n10 ~/.bashrc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Configure Hadoop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh root@{l} \"apt install -y libiberty-dev libxml2-dev libkrb5-dev libgsasl7-dev libuuid1 uuid-dev > /dev/null 2>&1\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### setup  short-circuit "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh root@{l} \"mkdir -p /var/lib/hadoop-hdfs/\"\n",
    "    !ssh root@{l} 'chown {user}:{user} /var/lib/hadoop-hdfs/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### enable security.authorization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "coresite='''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
    "<!--\n",
    "  Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "  you may not use this file except in compliance with the License.\n",
    "  You may obtain a copy of the License at\n",
    "\n",
    "    http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "  Unless required by applicable law or agreed to in writing, software\n",
    "  distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "  See the License for the specific language governing permissions and\n",
    "  limitations under the License. See accompanying LICENSE file.\n",
    "-->\n",
    "\n",
    "<!-- Put site-specific property overrides in this file. -->\n",
    "\n",
    "<configuration>\n",
    "  <property>\n",
    "      <name>fs.default.name</name>\n",
    "      <value>hdfs://{:s}:8020</value>\n",
    "      <final>true</final>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>hadoop.security.authentication</name>\n",
    "      <value>simple</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>hadoop.security.authorization</name>\n",
    "      <value>true</value>\n",
    "  </property>\n",
    "</configuration>\n",
    "'''.format(hostname)\n",
    "\n",
    "with open(f'/home/{user}/hadoop/etc/hadoop/core-site.xml','w') as f:\n",
    "    f.writelines(coresite)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/hadoop/etc/hadoop/core-site.xml {l}:~/hadoop/etc/hadoop/core-site.xml >/dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### set IP check, note the command <font color=red>\", \"</font>.join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "hadooppolicy='''<?xml version=\"1.0\"?>\n",
    "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
    "<!--\n",
    "\n",
    " Licensed to the Apache Software Foundation (ASF) under one\n",
    " or more contributor license agreements.  See the NOTICE file\n",
    " distributed with this work for additional information\n",
    " regarding copyright ownership.  The ASF licenses this file\n",
    " to you under the Apache License, Version 2.0 (the\n",
    " \"License\"); you may not use this file except in compliance\n",
    " with the License.  You may obtain a copy of the License at\n",
    "\n",
    "     http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    " Unless required by applicable law or agreed to in writing, software\n",
    " distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    " See the License for the specific language governing permissions and\n",
    " limitations under the License.\n",
    "\n",
    "-->\n",
    "\n",
    "<!-- Put site-specific property overrides in this file. -->\n",
    "\n",
    "<configuration>\n",
    "  <property>\n",
    "    <name>security.service.authorization.default.hosts</name>\n",
    "    <value>{:s}</value>\n",
    "  </property>\n",
    "  <property>\n",
    "    <name>security.service.authorization.default.acl</name>\n",
    "    <value>{:s} {:s}</value>\n",
    "  </property>\n",
    "  \n",
    "  \n",
    "    <property>\n",
    "    <name>security.client.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ClientProtocol, which is used by user code\n",
    "    via the DistributedFileSystem.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.client.datanode.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ClientDatanodeProtocol, the client-to-datanode protocol\n",
    "    for block recovery.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.datanode.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for DatanodeProtocol, which is used by datanodes to\n",
    "    communicate with the namenode.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.inter.datanode.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for InterDatanodeProtocol, the inter-datanode protocol\n",
    "    for updating generation timestamp.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.namenode.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for NamenodeProtocol, the protocol used by the secondary\n",
    "    namenode to communicate with the namenode.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    " <property>\n",
    "    <name>security.admin.operations.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for AdminOperationsProtocol. Used for admin commands.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.refresh.user.mappings.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for RefreshUserMappingsProtocol. Used to refresh\n",
    "    users mappings. The ACL is a comma-separated list of user and\n",
    "    group names. The user and group list is separated by a blank. For\n",
    "    e.g. \"alice,bob users,wheel\".  A special value of \"*\" means all\n",
    "    users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.refresh.policy.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for RefreshAuthorizationPolicyProtocol, used by the\n",
    "    dfsadmin and mradmin commands to refresh the security policy in-effect.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.ha.service.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for HAService protocol used by HAAdmin to manage the\n",
    "      active and stand-by states of namenode.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.zkfc.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for access to the ZK Failover Controller\n",
    "    </description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.qjournal.service.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for QJournalProtocol, used by the NN to communicate with\n",
    "    JNs when using the QuorumJournalManager for edit logs.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.mrhs.client.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for HSClientProtocol, used by job clients to\n",
    "    communciate with the MR History Server job status etc.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <!-- YARN Protocols -->\n",
    "\n",
    "  <property>\n",
    "    <name>security.resourcetracker.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ResourceTrackerProtocol, used by the\n",
    "    ResourceManager and NodeManager to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.resourcemanager-administration.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ResourceManagerAdministrationProtocol, for admin commands.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.applicationclient.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ApplicationClientProtocol, used by the ResourceManager\n",
    "    and applications submission clients to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.applicationmaster.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ApplicationMasterProtocol, used by the ResourceManager\n",
    "    and ApplicationMasters to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.containermanagement.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ContainerManagementProtocol protocol, used by the NodeManager\n",
    "    and ApplicationMasters to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.resourcelocalizer.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ResourceLocalizer protocol, used by the NodeManager\n",
    "    and ResourceLocalizer to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.job.task.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for TaskUmbilicalProtocol, used by the map and reduce\n",
    "    tasks to communicate with the parent tasktracker.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.job.client.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for MRClientProtocol, used by job clients to\n",
    "    communciate with the MR ApplicationMaster to query job status etc.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>security.applicationhistory.protocol.acl</name>\n",
    "    <value>*</value>\n",
    "    <description>ACL for ApplicationHistoryProtocol, used by the timeline\n",
    "    server and the generic history service client to communicate with each other.\n",
    "    The ACL is a comma-separated list of user and group names. The user and\n",
    "    group list is separated by a blank. For e.g. \"alice,bob users,wheel\".\n",
    "    A special value of \"*\" means all users are allowed.</description>\n",
    "  </property>\n",
    "\n",
    "  \n",
    "  \n",
    "  \n",
    "  \n",
    "</configuration>\n",
    "'''.format((\",\").join(hclients),user,user)\n",
    "\n",
    "with open(f'/home/{user}/hadoop/etc/hadoop/hadoop-policy.xml','w') as f:\n",
    "    f.writelines(hadooppolicy)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/hadoop/etc/hadoop/hadoop-policy.xml {l}:~/hadoop/etc/hadoop/hadoop-policy.xml >/dev/null 2>&1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### hdfs config, set replication to 1 to cache all the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "code_folding": [],
    "hidden": true
   },
   "outputs": [],
   "source": [
    "hdfs_data=\",\".join([f'{l}/{user}/hdfs/data' for l in datadir])\n",
    "\n",
    "hdfs_site=f'''<?xml version=\"1.0\"?>\n",
    "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
    "\n",
    "<!-- Put site-specific property overrides in this file. -->\n",
    "\n",
    "<configuration>\n",
    "  <property>\n",
    "    <name>dfs.namenode.secondary.http-address</name>\n",
    "    <value>{hostname}:50090</value>\n",
    "  </property>\n",
    "  <property>\n",
    "    <name>dfs.namenode.name.dir</name>\n",
    "    <value>{datadir[0]}/{user}/hdfs/name</value>\n",
    "    <final>true</final>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>dfs.datanode.data.dir</name>\n",
    "        <value>{hdfs_data}</value>\n",
    "    <final>true</final>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "    <name>dfs.namenode.checkpoint.dir</name>\n",
    "    <value>{datadir[0]}/{user}/hdfs/namesecondary</value>\n",
    "    <final>true</final>\n",
    "  </property>\n",
    "  <property>\n",
    "    <name>dfs.name.handler.count</name>\n",
    "    <value>100</value>\n",
    "  </property>\n",
    "  <property>\n",
    "    <name>dfs.blocksize</name>\n",
    "    <value>128m</value>\n",
    "</property>\n",
    "  <property>\n",
    "    <name>dfs.replication</name>\n",
    "    <value>1</value>\n",
    "</property>\n",
    "\n",
    "<property>\n",
    "   <name>dfs.client.read.shortcircuit</name>\n",
    "   <value>true</value>\n",
    "</property>\n",
    "\n",
    "<property>\n",
    "   <name>dfs.domain.socket.path</name>\n",
    "   <value>/var/lib/hadoop-hdfs/dn_socket</value>\n",
    "</property>\n",
    "\n",
    "</configuration>\n",
    "'''\n",
    "\n",
    "\n",
    "with open(f'/home/{user}/hadoop/etc/hadoop/hdfs-site.xml','w') as f:\n",
    "    f.writelines(hdfs_site)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/hadoop/etc/hadoop/hdfs-site.xml {l}:~/hadoop/etc/hadoop/hdfs-site.xml >/dev/null 2>&1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### mapreduce config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "mapreduce='''<?xml version=\"1.0\"?>\n",
    "<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n",
    "<!--\n",
    "  Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "  you may not use this file except in compliance with the License.\n",
    "  You may obtain a copy of the License at\n",
    "\n",
    "    http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "  Unless required by applicable law or agreed to in writing, software\n",
    "  distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "  See the License for the specific language governing permissions and\n",
    "  limitations under the License. See accompanying LICENSE file.\n",
    "-->\n",
    "\n",
    "<!-- Put site-specific property overrides in this file. -->\n",
    "\n",
    "<configuration>\n",
    "    <property>\n",
    "        <name>mapreduce.framework.name</name>\n",
    "        <value>yarn</value>\n",
    "    </property>\n",
    "\n",
    "     <property>\n",
    "         <name>mapreduce.job.maps</name>\n",
    "         <value>288</value>\n",
    "     </property>\n",
    "     <property>\n",
    "         <name>mapreduce.job.reduces</name>\n",
    "         <value>64</value>\n",
    "     </property>\n",
    "\n",
    "     <property>\n",
    "         <name>mapreduce.map.java.opts</name>\n",
    "         <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n",
    "     </property>\n",
    "      <property>\n",
    "         <name>mapreduce.map.memory.mb</name>\n",
    "         <value>6144</value>\n",
    "         </property>\n",
    "\n",
    "     <property>\n",
    "         <name>mapreduce.reduce.java.opts</name>\n",
    "         <value>-Xmx5120M -DpreferIPv4Stack=true</value>\n",
    "     </property>\n",
    "     <property>\n",
    "         <name>mapreduce.reduce.memory.mb</name>\n",
    "         <value>6144</value>\n",
    "     </property>\n",
    "     <property>\n",
    "         <name>yarn.app.mapreduce.am.staging-dir</name>\n",
    "         <value>/user</value>\n",
    "     </property>\n",
    "     <property>\n",
    "         <name>mapreduce.task.io.sort.mb</name>\n",
    "         <value>2000</value>\n",
    "     </property>\n",
    "     <property>\n",
    "         <name>mapreduce.task.timeout</name>\n",
    "         <value>3600000</value>\n",
    "     </property>\n",
    "<!-- MapReduce Job History Server security configs -->\n",
    "<property>\n",
    "  <name>mapreduce.jobhistory.address</name>\n",
    "  <value>{:s}:10020</value>\n",
    "</property>\n",
    "\n",
    "</configuration>\n",
    "'''.format(hostname)\n",
    "\n",
    "\n",
    "with open(f'/home/{user}/hadoop/etc/hadoop/mapred-site.xml','w') as f:\n",
    "    f.writelines(mapreduce)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/hadoop/etc/hadoop/mapred-site.xml {l}:~/hadoop/etc/hadoop/mapred-site.xml >/dev/null 2>&1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### yarn config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "code_folding": [],
    "hidden": true
   },
   "outputs": [],
   "source": [
    "yarn_site=f'''<?xml version=\"1.0\"?>\n",
    "<!--\n",
    "  Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "  you may not use this file except in compliance with the License.\n",
    "  You may obtain a copy of the License at\n",
    "\n",
    "    http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "  Unless required by applicable law or agreed to in writing, software\n",
    "  distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "  See the License for the specific language governing permissions and\n",
    "  limitations under the License. See accompanying LICENSE file.\n",
    "-->\n",
    "<configuration>\n",
    "  <property>\n",
    "      <name>yarn.resourcemanager.hostname</name>\n",
    "      <value>{hostname}</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.resourcemanager.address</name>\n",
    "      <value>{hostname}:8032</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.resourcemanager.webapp.address</name>\n",
    "      <value>{hostname}:8088</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.resource.memory-mb</name>\n",
    "      <value>{mem_mib}</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.resource.cpu-vcores</name>\n",
    "      <value>{cpu_num}</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.pmem-check-enabled</name>\n",
    "      <value>false</value>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.vmem-check-enabled</name>\n",
    "      <value>false</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.vmem-pmem-ratio</name>\n",
    "      <value>4.1</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.aux-services</name>\n",
    "      <value>mapreduce_shuffle,spark_shuffle</value>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "      <name>yarn.scheduler.minimum-allocation-mb</name>\n",
    "      <value>1024</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.scheduler.maximum-allocation-mb</name>\n",
    "      <value>{mem_mib}</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.scheduler.minimum-allocation-vcores</name>\n",
    "      <value>1</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.scheduler.maximum-allocation-vcores</name>\n",
    "      <value>{cpu_num}</value>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "      <name>yarn.log-aggregation-enable</name>\n",
    "      <value>false</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.log.retain-seconds</name>\n",
    "      <value>36000</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.delete.debug-delay-sec</name>\n",
    "      <value>3600</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.log.server.url</name>\n",
    "      <value>http://{hostname}:19888/jobhistory/logs/</value>\n",
    "  </property>\n",
    "\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.log-dirs</name>\n",
    "      <value>/home/{user}/hadoop/logs/userlogs</value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.local-dirs</name>\n",
    "      <value>{sparklocals}\n",
    "      </value>\n",
    "  </property>\n",
    "  <property>\n",
    "      <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>\n",
    "      <value>org.apache.spark.network.yarn.YarnShuffleService</value>\n",
    "  </property>\n",
    "</configuration>\n",
    "'''\n",
    "\n",
    "\n",
    "with open(f'/home/{user}/hadoop/etc/hadoop/yarn-site.xml','w') as f:\n",
    "    f.writelines(yarn_site)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/hadoop/etc/hadoop/yarn-site.xml {l}:~/hadoop/etc/hadoop/yarn-site.xml >/dev/null 2>&1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### hadoop-env"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "#config java home\n",
    "if is_arm:\n",
    "    !echo \"export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n",
    "else:\n",
    "    !echo \"export JAVA_HOME={java_home}\" >> ~/hadoop/etc/hadoop/hadoop-env.sh\n",
    "\n",
    "for l in clients:\n",
    "    !scp hadoop/etc/hadoop/hadoop-env.sh {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### workers config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "if clients:\n",
    "    with open(f'/home/{user}/hadoop/etc/hadoop/workers','w') as f:\n",
    "        f.writelines(\"\\n\".join(clients))\n",
    "    for l in clients:\n",
    "        !scp hadoop/etc/hadoop/workers {l}:~/hadoop/etc/hadoop/ >/dev/null 2>&1\n",
    "else:\n",
    "    !echo {hostname} > ~/hadoop/etc/hadoop/workers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Copy jar from Spark for external shuffle service"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !scp spark/yarn/spark-3.3.1-yarn-shuffle.jar {l}:~/hadoop/share/hadoop/common/lib/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Configure Spark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "eventlog_dir=f'hdfs://{hostname}:8020/tmp/sparkEventLog'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "sparkconf=f'''\n",
    "spark.eventLog.enabled    true\n",
    "spark.eventLog.dir        {eventlog_dir}\n",
    "spark.history.fs.logDirectory    {eventlog_dir}\n",
    "'''\n",
    "\n",
    "with open(f'/home/{user}/spark/conf/spark-defaults.conf','w+') as f:\n",
    "    f.writelines(sparkconf)\n",
    "    \n",
    "for l in clients:\n",
    "    !scp ~/spark/conf/spark-defaults.conf {l}:~/spark/conf/spark-defaults.conf >/dev/null 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "sparkenv = f'export SPARK_LOCAL_DIRS={sparklocals}\\n'\n",
    "with open(f'/home/{user}/.bashrc', 'a+') as f:\n",
    "    f.writelines(sparkenv)\n",
    "for l in clients:\n",
    "    !scp ~/.bashrc {l}:~/.bashrc >/dev/null 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} tail -n10 ~/.bashrc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Configure monitor & startups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!cd ~\n",
    "!git clone https://github.com/trailofbits/tsc_freq_khz.git\n",
    "\n",
    "for l in clients:\n",
    "    !scp -r tsc_freq_khz {l}:~/\n",
    "\n",
    "for l in hclients:\n",
    "    !ssh {l} 'cd tsc_freq_khz && make && sudo insmod ./tsc_freq_khz.ko' >/dev/null 2>&1\n",
    "    !ssh root@{l} 'dmesg | grep tsc_freq_khz'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "startup=f'''#!/bin/bash\n",
    "echo -1 > /proc/sys/kernel/perf_event_paranoid\n",
    "echo 0 > /proc/sys/kernel/kptr_restrict\n",
    "echo madvise >/sys/kernel/mm/transparent_hugepage/enabled\n",
    "echo 1 > /proc/sys/kernel/numa_balancing\n",
    "end=$(($(nproc) - 1))\n",
    "for i in $(seq 0 $end); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done\n",
    "for file in $(find /sys/devices/system/cpu/cpu*/power/energy_perf_bias); do echo \"0\" > $file; done\n",
    "\n",
    "if [ -d /home/{user}/sep_installed ]; then\n",
    "    /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\n",
    "fi\n",
    "'''\n",
    "\n",
    "with open('/tmp/tmpstartup', 'w') as f:\n",
    "    f.writelines(startup)\n",
    "\n",
    "startup_service=f'''[Unit]\n",
    "Description=Configure Transparent Hugepage, Auto NUMA Balancing, CPU Freq Scaling Governor\n",
    "\n",
    "[Service]\n",
    "ExecStart=/usr/local/bin/mystartup.sh\n",
    "\n",
    "[Install]\n",
    "WantedBy=multi-user.target\n",
    "'''\n",
    "\n",
    "with open('/tmp/tmpstartup_service', 'w') as f:\n",
    "    f.writelines(startup_service)\n",
    "    \n",
    "for l in hclients:\n",
    "    !scp /tmp/tmpstartup $l:/tmp/tmpstartup\n",
    "    !scp /tmp/tmpstartup_service $l:/tmp/tmpstartup_service\n",
    "    !ssh root@$l \"cat /tmp/tmpstartup > /usr/local/bin/mystartup.sh\"\n",
    "    !ssh root@$l \"chmod +x /usr/local/bin/mystartup.sh\"\n",
    "    !ssh root@$l \"cat /tmp/tmpstartup_service > /etc/systemd/system/mystartup.service\"\n",
    "    !ssh $l \"sudo systemctl enable mystartup.service\"\n",
    "    !ssh $l \"sudo systemctl start mystartup.service\"\n",
    "    !ssh $l \"sudo systemctl status mystartup.service\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Install Emon"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "<font color=red size=3> Get the latest offline installer from [link](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html?operatingsystem=linux&linux-install-type=offline) </font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "offline_installer = 'https://registrationcenter-download.intel.com/akdlm/IRC_NAS/e7797b12-ce87-4df0-aa09-df4a272fc5d9/intel-vtune-2025.0.0.1130_offline.sh'\n",
    "for l in hclients:\n",
    "    !ssh {l} \"wget {offline_installer} -q && chmod +x intel-vtune-2025.0.0.1130_offline.sh\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} \"sudo ./intel-vtune-2025.0.0.1130_offline.sh -a -c -s --eula accept\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} \"sudo chown -R {user}:{user} /opt/intel/oneapi/vtune/ && rm -f sep_installed && ln -s /opt/intel/oneapi/vtune/latest sep_installed\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} \"cd sep_installed/sepdk/src/; echo -e \\\"\\\\n\\\\n\\\\n\\\" | ./build-driver\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh root@{l} \"/home/{user}/sep_installed/sepdk/src/rmmod-sep && /home/{user}/sep_installed/sepdk/src/insmod-sep -g {user}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1; emon -v | head -n 1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} 'echo \"source /home/{user}/sep_installed/sep_vars.sh > /dev/null 2>&1\" >> ~/.bashrc'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for c in hclients:\n",
    "    !ssh {c} 'tail -n1 ~/.bashrc'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "heading_collapsed": true,
    "hidden": true,
    "run_control": {
     "frozen": true
    }
   },
   "source": [
    "## Inspect CPU Freq & HT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "hidden": true,
    "run_control": {
     "frozen": true
    }
   },
   "outputs": [],
   "source": [
    "if is_arm:\n",
    "    t = r'''\n",
    "    #include <stdlib.h>\n",
    "    #include <stdio.h>\n",
    "    #include <assert.h>\n",
    "    #include <sys/time.h>\n",
    "    #include <sched.h>\n",
    "    #include <sys/shm.h>\n",
    "    #include <errno.h>\n",
    "    #include <sys/mman.h>\n",
    "    #include <unistd.h>                        //used for parsing the command line arguments\n",
    "    #include <fcntl.h>                        //used for opening the memory device file\n",
    "    #include <math.h>                        //used for rounding functions\n",
    "    #include <signal.h>\n",
    "    #include <linux/types.h>\n",
    "    #include <stdint.h>\n",
    "\n",
    "    static inline uint64_t GetTickCount()\n",
    "    {//Return ns counts\n",
    "        struct timeval tp;\n",
    "        gettimeofday(&tp,NULL);\n",
    "        return tp.tv_sec*1000+tp.tv_usec/1000;\n",
    "    }\n",
    "\n",
    "    uint64_t CNT=CNT_DEF;\n",
    "\n",
    "    int main()\n",
    "    {\n",
    "\n",
    "      uint64_t start, end;\n",
    "      start=end=GetTickCount();\n",
    "\n",
    "        asm  volatile (\n",
    "          \"1:\\n\"\n",
    "          \"SUBS %0,%0,#1\\n\"\n",
    "          \"bne 1b\\n\"\n",
    "          ::\"r\"(CNT)\n",
    "          );\n",
    "\n",
    "      end=GetTickCount();\n",
    "\n",
    "      printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n",
    "\n",
    "      return 0;\n",
    "    }\n",
    "    '''\n",
    "else:\n",
    "    t=r'''\n",
    "    #include <stdlib.h>\n",
    "    #include <stdio.h>\n",
    "    #include <assert.h>\n",
    "    #include <sys/time.h>\n",
    "    #include <sched.h>\n",
    "    #include <sys/shm.h>\n",
    "    #include <errno.h>\n",
    "    #include <sys/mman.h>\n",
    "    #include <unistd.h>                        //used for parsing the command line arguments\n",
    "    #include <fcntl.h>                        //used for opening the memory device file\n",
    "    #include <math.h>                        //used for rounding functions\n",
    "    #include <signal.h>\n",
    "    #include <linux/types.h>\n",
    "    #include <stdint.h>\n",
    "\n",
    "    static inline uint64_t GetTickCount()\n",
    "    {//Return ns counts\n",
    "        struct timeval tp;\n",
    "        gettimeofday(&tp,NULL);\n",
    "        return tp.tv_sec*1000+tp.tv_usec/1000;\n",
    "    }\n",
    "\n",
    "    uint64_t CNT=CNT_DEF;\n",
    "\n",
    "    int main()\n",
    "    {\n",
    "\n",
    "      uint64_t start, end;\n",
    "      start=end=GetTickCount();\n",
    "\n",
    "        asm  volatile (\n",
    "          \"1:\\n\"\n",
    "          \"dec %0\\n\"\n",
    "          \"jnz 1b\\n\"\n",
    "          ::\"r\"(CNT)\n",
    "          );\n",
    "\n",
    "      end=GetTickCount();\n",
    "\n",
    "      printf(\" total time = %lu, freq = %lu \\n\", end-start, CNT/(end-start)/1000);\n",
    "\n",
    "      return 0;\n",
    "    }\n",
    "    '''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "hidden": true,
    "run_control": {
     "frozen": true
    }
   },
   "outputs": [],
   "source": [
    "%cd ~\n",
    "with open(\"t.c\", 'w') as f:\n",
    "    f.writelines(t)\n",
    "!gcc -O3 -DCNT_DEF=10000000000LL -o t t.c; gcc -O3 -DCNT_DEF=1000000000000LL -o t.delay t.c;\n",
    "!for j in `seq 1 $(nproc)`; do echo -n $j; (for i in `seq 1 $j`; do taskset -c $i ./t.delay & done); sleep 1; ./t; killall t.delay; sleep 2; done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# <font color=red>Shutdown Jupyter; source ~/.bashrc; reboot Jupyter; run section [Initialize](#Initialize)</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Build gluten"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Install docker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Instructions from https://docs.docker.com/engine/install/ubuntu/\n",
    "\n",
    "# Add Docker's official GPG key:\n",
    "!sudo -E apt-get update\n",
    "!sudo -E apt-get install ca-certificates curl\n",
    "!sudo -E install -m 0755 -d /etc/apt/keyrings\n",
    "!sudo -E curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc\n",
    "!sudo chmod a+r /etc/apt/keyrings/docker.asc\n",
    "\n",
    "# Add the repository to Apt sources:\n",
    "!echo \\\n",
    "  \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \\\n",
    "  $(. /etc/os-release && echo \"$VERSION_CODENAME\") stable\" | \\\n",
    "  sudo -E tee /etc/apt/sources.list.d/docker.list > /dev/null\n",
    "!sudo -E apt-get update"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!sudo -E apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin >/dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Configure docker proxy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "http_proxy=os.getenv('http_proxy')\n",
    "https_proxy=os.getenv('https_proxy')\n",
    "\n",
    "if http_proxy or https_proxy:\n",
    "    !sudo mkdir -p /etc/systemd/system/docker.service.d\n",
    "    with open('/tmp/http-proxy.conf', 'w') as f:\n",
    "        s = '''\n",
    "[Service]\n",
    "{}\n",
    "{}\n",
    "'''.format(f'Environment=\"HTTP_PROXY={http_proxy}\"' if http_proxy else '', f'Environment=\"HTTPS_PROXY={https_proxy}\"' if https_proxy else '')\n",
    "        f.writelines(s)\n",
    "    !sudo cp /tmp/http-proxy.conf /etc/systemd/system/docker.service.d\n",
    "    \n",
    "    !ssh root@localhost \"mkdir -p /root/.docker\"\n",
    "    with open(f'/tmp/config.json', 'w') as f:\n",
    "        s = f'''\n",
    "{{\n",
    " \"proxies\": {{\n",
    "   \"default\": {{\n",
    "     \"httpProxy\": \"{http_proxy}\",\n",
    "     \"httpsProxy\": \"{https_proxy}\",\n",
    "     \"noProxy\": \"127.0.0.0/8\"\n",
    "   }}\n",
    " }}\n",
    "}}\n",
    "        '''\n",
    "        f.writelines(s)\n",
    "    !ssh root@localhost \"cp -f /tmp/config.json /root/.docker\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Configure maven proxy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p ~/.m2\n",
    "\n",
    "def get_proxy(proxy):\n",
    "    pos0 = proxy.rfind('/')\n",
    "    pos = proxy.rfind(':')\n",
    "    host = http_proxy[pos0+1:pos]\n",
    "    port = http_proxy[pos+1:]\n",
    "    return host, port\n",
    "\n",
    "if http_proxy or https_proxy:\n",
    "    with open(f\"/home/{user}/.m2/settings.xml\",\"w+\") as f:\n",
    "        f.write('''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<settings xmlns=\"http://maven.apache.org/SETTINGS/1.2.0\"\n",
    "          xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n",
    "          xsi:schemaLocation=\"http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd\">\n",
    "    <proxies>''')\n",
    "        if http_proxy:\n",
    "            host, port = get_proxy(http_proxy)\n",
    "            f.write(f'''\n",
    "        <proxy>\n",
    "          <id>http_proxy</id>\n",
    "          <active>true</active>\n",
    "          <protocol>http</protocol>\n",
    "          <host>{host}</host>\n",
    "          <port>{port}</port>\n",
    "        </proxy>''')\n",
    "        if https_proxy:\n",
    "            host, port = get_proxy(http_proxy)\n",
    "            f.write(f'''\n",
    "        <proxy>\n",
    "          <id>https_proxy</id>\n",
    "          <active>true</active>\n",
    "          <protocol>https</protocol>\n",
    "          <host>{host}</host>\n",
    "          <port>{port}</port>\n",
    "        </proxy>''')\n",
    "        f.write('''\n",
    "    </proxies>\n",
    "</settings>\n",
    "''')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!sudo systemctl daemon-reload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!sudo systemctl restart docker.service"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Build gluten"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "%cd ~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "if not os.path.exists('gluten'):\n",
    "    !git clone https://github.com/apache/incubator-gluten.git gluten"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Build Arrow for the first time build.\n",
    "!sed -i 's/--build_arrow=OFF/--build_arrow=ON/' ~/gluten/dev/package-vcpkg.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "%cd ~/gluten/tools/workload/benchmark_velox"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!bash build_gluten.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !scp ~/gluten/package/target/gluten-velox-bundle-spark*.jar {l}:~/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Generate data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Build spark-sql-perf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!echo \"deb https://repo.scala-sbt.org/scalasbt/debian all main\" | sudo tee /etc/apt/sources.list.d/sbt.list\n",
    "!echo \"deb https://repo.scala-sbt.org/scalasbt/debian /\" | sudo tee /etc/apt/sources.list.d/sbt_old.list\n",
    "!curl -sL \"https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823\" | sudo apt-key add\n",
    "!sudo -E apt-get update > /dev/null 2>&1\n",
    "!sudo -E apt-get install sbt > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "http_proxy=os.getenv('http_proxy')\n",
    "https_proxy=os.getenv('https_proxy')\n",
    "\n",
    "def get_proxy(proxy):\n",
    "    pos0 = proxy.rfind('/')\n",
    "    pos = proxy.rfind(':')\n",
    "    host = http_proxy[pos0+1:pos]\n",
    "    port = http_proxy[pos+1:]\n",
    "    return host, port\n",
    "\n",
    "sbt_opts=''\n",
    "\n",
    "if http_proxy:\n",
    "    host, port = get_proxy(http_proxy)\n",
    "    sbt_opts = f'{sbt_opts} -Dhttp.proxyHost={host} -Dhttp.proxyPort={port}'\n",
    "if https_proxy:\n",
    "    host, port = get_proxy(https_proxy)\n",
    "    sbt_opts = f'{sbt_opts} -Dhttps.proxyHost={host} -Dhttps.proxyPort={port}'\n",
    "    \n",
    "if sbt_opts:\n",
    "    %env SBT_OPTS={sbt_opts}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/databricks/spark-sql-perf.git ~/spark-sql-perf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!cd ~/spark-sql-perf && sbt package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!cp ~/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar ~/ipython/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## Start Hadoop/Spark cluster, Spark history server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!~/hadoop/bin/hadoop namenode -format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!~/hadoop/bin/hadoop datanode -format "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!~/hadoop/sbin/start-dfs.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!hadoop dfsadmin -safemode leave"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!hadoop fs -mkdir -p /tmp/sparkEventLog"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!cd ~/spark && sbin/start-history-server.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "master=''\n",
    "if clients:\n",
    "    !~/hadoop/sbin/start-yarn.sh\n",
    "    master='yarn'\n",
    "else:\n",
    "    # If we run on single node, we use standalone mode\n",
    "    !{os.environ['SPARK_HOME']}/sbin/stop-slave.sh\n",
    "    !{os.environ['SPARK_HOME']}/sbin/stop-master.sh\n",
    "    !{os.environ['SPARK_HOME']}/sbin/start-master.sh\n",
    "    !{os.environ['SPARK_HOME']}/sbin/start-worker.sh spark://{hostname}:7077 -c {cpu_num}\n",
    "    master=f'spark://{hostname}:7077'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!jps\n",
    "for l in clients:\n",
    "    !ssh {l} jps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## TPCH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!rm -rf ~/tpch-dbgen\n",
    "!git clone https://github.com/databricks/tpch-dbgen.git ~/tpch-dbgen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !scp -r ~/tpch-dbgen {l}:~/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} cd ~/tpch-dbgen && git checkout 0469309147b42abac8857fa61b4cf69a6d3128a8 && make clean && make OS=LINUX"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "%cd  ~/gluten/tools/workload/tpch/gen_data/parquet_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Suggest 2x cpu# partitions.\n",
    "scaleFactor = 1500\n",
    "numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n",
    "dataformat = \"parquet\" # data format of data source\n",
    "dataSourceCodec = \"snappy\"\n",
    "rootDir = f\"/tpch_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Verify parameters\n",
    "print(f'scaleFactor = {scaleFactor}')\n",
    "print(f'numPartitions = {numPartitions}')\n",
    "print(f'dataformat = {dataformat}')\n",
    "print(f'rootDir = {rootDir}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "scala=f'''import com.databricks.spark.sql.perf.tpch._\n",
    "\n",
    "\n",
    "val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n",
    "val numPartitions = {numPartitions}  // how many dsdgen partitions to run - number of input tasks.\n",
    "\n",
    "val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n",
    "val rootDir = \"{rootDir}\" // root directory of location to create data in.\n",
    "val dbgenDir = \"/home/{user}/tpch-dbgen\" // location of dbgen\n",
    "\n",
    "val tables = new TPCHTables(spark.sqlContext,\n",
    "    dbgenDir = dbgenDir,\n",
    "    scaleFactor = scaleFactor,\n",
    "    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n",
    "    useStringForDate = false) // true to replace DateType with StringType\n",
    "\n",
    "\n",
    "tables.genData(\n",
    "    location = rootDir,\n",
    "    format = format,\n",
    "    overwrite = true, // overwrite the data that is already there\n",
    "    partitionTables = false, // do not create the partitioned fact tables\n",
    "    clusterByPartitionColumns = false, // shuffle to get partitions coalesced into single files.\n",
    "    filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n",
    "    tableFilter = \"\", // \"\" means generate all tables\n",
    "    numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n",
    "'''\n",
    "\n",
    "with open(\"tpch_datagen_parquet.scala\",\"w\") as f:\n",
    "    f.writelines(scala)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "executor_cores = 8\n",
    "num_executors=cpu_num/executor_cores\n",
    "executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n",
    "\n",
    "# Verify parameters\n",
    "print(f'--master {master}')\n",
    "print(f'--num-executors {int(num_executors)}')\n",
    "print(f'--executor-cores {int(executor_cores)}')\n",
    "print(f'--executor-memory {int(executor_memory)}k')\n",
    "print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n",
    "print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "tpch_datagen_parquet=f'''\n",
    "cat tpch_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n",
    "  --master {master} \\\n",
    "  --name tpch_gen_parquet \\\n",
    "  --driver-memory 10g \\\n",
    "  --num-executors {int(num_executors)} \\\n",
    "  --executor-cores {int(executor_cores)} \\\n",
    "  --executor-memory {int(executor_memory)}k \\\n",
    "  --conf spark.executor.memoryOverhead=1g \\\n",
    "  --conf spark.sql.broadcastTimeout=4800 \\\n",
    "  --conf spark.driver.maxResultSize=4g \\\n",
    "  --conf spark.sql.shuffle.partitions={numPartitions} \\\n",
    "  --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n",
    "  --conf spark.network.timeout=800s \\\n",
    "  --conf spark.executor.heartbeatInterval=200s \\\n",
    "  --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n",
    "'''\n",
    "\n",
    "with open(\"tpch_datagen_parquet.sh\",\"w\") as f:\n",
    "    f.writelines(tpch_datagen_parquet)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!nohup bash tpch_datagen_parquet.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "## TPCDS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!rm -rf ~/tpcds-kit\n",
    "!git clone https://github.com/databricks/tpcds-kit.git ~/tpcds-kit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !scp -r ~/tpcds-kit {l}:~/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in hclients:\n",
    "    !ssh {l} \"cd ~/tpcds-kit/tools && make clean && make OS=LINUX CC=gcc-9\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "%cd ~/gluten/tools/workload/tpcds/gen_data/parquet_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Suggest 2x cpu# partitions\n",
    "scaleFactor = 1500\n",
    "numPartitions = 2*cpu_num if len(clients)==0 else len(clients)*2*cpu_num\n",
    "dataformat = \"parquet\" # data format of data source\n",
    "dataSourceCodec = \"snappy\"\n",
    "rootDir = f\"/tpcds_sf{scaleFactor}_{dataformat}_{dataSourceCodec}\" # root directory of location to create data in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Verify parameters\n",
    "print(f'scaleFactor = {scaleFactor}')\n",
    "print(f'numPartitions = {numPartitions}')\n",
    "print(f'dataformat = {dataformat}')\n",
    "print(f'rootDir = {rootDir}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "scala=f'''import com.databricks.spark.sql.perf.tpcds._\n",
    "\n",
    "val scaleFactor = \"{scaleFactor}\" // scaleFactor defines the size of the dataset to generate (in GB).\n",
    "val numPartitions = {numPartitions}  // how many dsdgen partitions to run - number of input tasks.\n",
    "\n",
    "val format = \"{dataformat}\" // valid spark format like parquet \"parquet\".\n",
    "val rootDir = \"{rootDir}\" // root directory of location to create data in.\n",
    "val dsdgenDir = \"/home/{user}/tpcds-kit/tools/\" // location of dbgen\n",
    "\n",
    "val tables = new TPCDSTables(spark.sqlContext,\n",
    "    dsdgenDir = dsdgenDir,\n",
    "    scaleFactor = scaleFactor,\n",
    "    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType\n",
    "    useStringForDate = false) // true to replace DateType with StringType\n",
    "\n",
    "\n",
    "tables.genData(\n",
    "    location = rootDir,\n",
    "    format = format,\n",
    "    overwrite = true, // overwrite the data that is already there\n",
    "    partitionTables = true, // create the partitioned fact tables\n",
    "    clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.\n",
    "    filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value\n",
    "    tableFilter = \"\", // \"\" means generate all tables\n",
    "    numPartitions = numPartitions) // how many dsdgen partitions to run - number of input tasks.\n",
    "'''\n",
    "\n",
    "with open(\"tpcds_datagen_parquet.scala\",\"w\") as f:\n",
    "    f.writelines(scala)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "executor_cores = 8\n",
    "num_executors=cpu_num/executor_cores\n",
    "executor_memory = (totalmemory - 10*1024*1024)/num_executors - 1*1024*1024\n",
    "\n",
    "# Verify parameters\n",
    "print(f'--master {master}')\n",
    "print(f'--num-executors {int(num_executors)}')\n",
    "print(f'--executor-cores {int(executor_cores)}')\n",
    "print(f'--executor-memory {int(executor_memory)}k')\n",
    "print(f'--conf spark.sql.shuffle.partitions={numPartitions}')\n",
    "print(f'--conf spark.sql.parquet.compression.codec={dataSourceCodec}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "tpcds_datagen_parquet=f'''\n",
    "cat tpcds_datagen_parquet.scala | {os.environ['SPARK_HOME']}/bin/spark-shell \\\n",
    "  --master {master} \\\n",
    "  --name tpcds_gen_parquet \\\n",
    "  --driver-memory 10g \\\n",
    "  --num-executors {int(num_executors)} \\\n",
    "  --executor-cores {int(executor_cores)} \\\n",
    "  --executor-memory {int(executor_memory)}k \\\n",
    "  --conf spark.executor.memoryOverhead=1g \\\n",
    "  --conf spark.sql.broadcastTimeout=4800 \\\n",
    "  --conf spark.driver.maxResultSize=4g \\\n",
    "  --conf spark.sql.shuffle.partitions={numPartitions} \\\n",
    "  --conf spark.sql.parquet.compression.codec={dataSourceCodec} \\\n",
    "  --conf spark.network.timeout=800s \\\n",
    "  --conf spark.executor.heartbeatInterval=200s \\\n",
    "  --jars /home/{user}/ipython/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \\\n",
    "'''\n",
    "\n",
    "with open(\"tpcds_datagen_parquet.sh\",\"w\") as f:\n",
    "    f.writelines(tpcds_datagen_parquet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!nohup bash tpcds_datagen_parquet.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Set up perf analysis tools (optional)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We have a set of perf analysis scripts under $GLUTEN_HOME/tools/workload/benchmark_velox/analysis. You can follow below steps to deploy the scripts on the same cluster and use them for performance analysis after each run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "## Install and deploy Trace-Viewer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Clone the master branch of project catapult:\n",
    "```\n",
    "cd ~\n",
    "git clone https://github.com/catapult-project/catapult.git -b master\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Trace-Viewer requires python version 2.7. Create a virtualenv for python2.7:\n",
    "```\n",
    "sudo apt install -y python2.7\n",
    "virtualenv -p /usr/bin/python2.7 py27-env\n",
    "source py27-env/bin/activate\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Apply patch:\n",
    "\n",
    "```\n",
    "cd catapult\n",
    "```\n",
    "```\n",
    "git apply <<EOF\n",
    "diff --git a/catapult_build/dev_server.py b/catapult_build/dev_server.py\n",
    "index a8b25c485..ec9638c7b 100644\n",
    "--- a/catapult_build/dev_server.py\n",
    "+++ b/catapult_build/dev_server.py\n",
    "@@ -328,11 +328,11 @@ def Main(argv):\n",
    "\n",
    "   app = DevServerApp(pds, args=args)\n",
    "\n",
    "-  server = httpserver.serve(app, host='127.0.0.1', port=args.port,\n",
    "+  server = httpserver.serve(app, host='0.0.0.0', port=args.port,\n",
    "                             start_loop=False, daemon_threads=True)\n",
    "   _AddPleaseExitMixinToServer(server)\n",
    "   # pylint: disable=no-member\n",
    "-  server.urlbase = 'http://127.0.0.1:%i' % server.server_port\n",
    "+  server.urlbase = 'http://0.0.0.0:%i' % server.server_port\n",
    "   app.server = server\n",
    "\n",
    "   sys.stderr.write('Now running on %s\\n' % server.urlbase)\n",
    "EOF\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Start the service:\n",
    "\n",
    "```\n",
    "mkdir -p ~/trace_result\n",
    "cd ~/catapult && nohup ./bin/run_dev_server --no-install-hooks -d ~/trace_result -p1088 &\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "## Deploy perf analysis scripts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Create a virtualenv to run the perf analaysis scripts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "%cd ~"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!virtualenv -p python3 -v paus-env\n",
    "!source ~/paus-env/bin/activate && python3 -m pip install -r ~/gluten/tools/workload/benchmark_velox/analysis/requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We will put all perf analysis notebooks under `$HOME/PAUS`. Create the directory and start the notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!mkdir -p ~/PAUS && cd ~/PAUS && source ~/paus-env/bin/activate && nohup jupyter notebook --ip=0.0.0.0 --port=8889 &"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Package the virtual environment so that it can be distributed to other nodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "!cd ~ && tar -czf paus-env.tar.gz paus-env"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Distribute to the worker nodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "for l in clients:\n",
    "    !scp ~/paus-env.tar.gz {l}:~/\n",
    "    !ssh {l} tar -zxf paus-env.tar.gz"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
